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Abstract. Estimating frequency moments of data streams is a very well studied problem [1-3,9, 12] 
and tight bounds are known on the amount of space that is necessary and sufficient when the stream 
is adversarially ordered. Recently, motivated by various practical considerations and applications 
in learning and statistics, there has been growing interest into studying streams that are randomly 
ordered [3,4,6-8, 11]. In the paper we improve the previous lower bounds on the space required to 
estimate the frequency moments of a randomly ordered streams. 

1. Introduction 

Consider a stream (ai, . . . , a m ) where each a, G [n]. The fc-th frequency moment is defined as 



where /, = \{j ■ aj = It is known that (E (n 1 ~ 2 / fc ) space is necessary and sufficient to estimate 
Fk in the data-stream model when the stream is ordered adversarially [5,9]. Recently, there has 
been a growing interest in understanding the data-stream model when the stream is determined 
by a set of m elements and a random permutation of these elements [3,4,6-8, 11]. Here the goal 
is to understand the amount of space and/or passes that is required to solve a problem with 
large probability where the probability is taken over both the coin flips of the algorithm and the 
random permutation of the stream. For some problems, significantly less resources are required 
in this model, e.g., it was shown that any 0(polylog n)-space algorithm for finding the median of 
a length-n stream with 9/10 probability requires 0(logn/loglogn) passes in the adversarial-order 
model whereas in the random-order model, O(loglogn) passes suffices. For further details and 
a motivation for the random-order model, including its relevance to applications in learning and 
statistics, see [8,10]. 

1.1. Previous Result and Our Result. The previous best lower bound for estimating in 
a random-order data-stream model was ri(n 1_3 / fc / log n). The hardness instance (based on the 
unique intersection promise in the multi-party set-disjointness problem) consisted of either a) at 
most n elements of multiplicity one or b) f2(n) elements of multiplicity one and an element of 
multiplicity n 1 ^. The bound follows by considering the communication when the elements are 
partitioned uniformly at random between P = B(n 2//fc ) players. With high probability over the 
random partition, it can be shown that any one-way protocol requires sending of (9(n 1_1 / fc ) bits in 
total and hence at least one message must require f2(n 1-3 / fc ) bits of communication in a constant 
round protocol [3]. This yields a r2(n 1_3//fc / log n) space lower bound for the single pass data-stream 
problem that is based on the observation that if the i-th player randomly orders their elements to 
form a stream Sj, then the stream formed my concatenating these streams (s%, s%, . . . , sp) is in 
random order. 

Note that the above communication bound is tight in the sense that if P was o{n 2 / k ) then, with 
large probability in case b) at least one player would receive two identical elements while in case a) 
this would can not happen since there are no duplicate elements. Our new approach sidesteps this 



issues and we prove that ^(n 1 - 2 - 5 ^) bits of space is required to estimate F^. The best upper-bound 
known for estimating Fk in random-order streams is the same as in the case of adversarial streams, 
i.e., O e (n 1_2 / fc ). We conjecture that the actual space complexity for Fk in the random-order model 
is e (n 1_2//fe ). In other words, that frequency moments, unlike the median problem, is just as hard 
in the random-order model as it is in the adversarial-order model. 

2. The New Bound 

At the heart of our proof is a reduction from t-party set-disjointness. An instance of this problem 
consists of t subsets S{ C [N] where the i-th player knows only Si. These subsets satisfy the 
condition that each j 6 [N] appears in either 0, 1, or t of the subsets. The problem is to determine 
if there exists j such that j € Si for all i € [t]. Furthermore, we may assume that \S\\ = [S^l = 
. . . = \St\ = cN/t for some arbitrarily small constant c > 0. It was shown that any randomized 
protocol (maybe using public random bits) that solves t-party set-disjointness with probability 2/3 
requires Cl(N/(t log N)) bits of communication [5]. 

Our argument works by assuming the existence of an s-space, single-pass, data-stream algorithm 
that returns a 2-approximation for Fj. of a 0(n)-length stream with probability 99/100 on the 
assumption that the order of the stream elements is chosen uniformly at random from the set of all 
orderings. We use this algorithm to construct a communication protocol for i-party set-disjointness 
when N = 0(n 1-1 /( 2fc )) and t = Q.{n l / k ). The protocol uses 0{sn l ' k ) bits and we therefore deduce 
that s = Q,(n x - 2 - b l k /{\ogn)). 

Before we present this protocol, we present two preliminary lemmas that will be important. 

Lemma 1. Let X = {F, . . . , I t } be t = n 1//fc random sets from Cycle nw := {{i — 1 (mod n) + 
1, . . . ,w + i — 2 (mod n) + l} : i € [n]} where w = cjn 1- 3 /( 2fc '. For small enough c\, with probability 
at least 99/100, 

(1) I h n F 2 n F 3 = for any i x <i 2 < i%. 

(2) |{(ii,i 2 ) :ii <i 2 ,/i 1 nJ ia t^0| <nV(»). 

Proof. We may assume that w\n by adjusting c\ and we partition [n] = J\ U . . . U J n / W where 
Jj = {1 + [i — l)w,iw}. For the first part, note that it is sufficient to bound the probability 
that there does not exist i such that Jj intersects with at three of more of the t intervals in 
1 . But, the probability that a particular Jj intersects with three of more of these is at most 
(g)(2u;/n) 3 < (2iu)/n) 3 . Hence the expected number of i € [n/w] such that Jj intersects with three 
of more of the t intervals is at most 

{n/w){2tw/nf = 8w 2 t 3 /n 2 = 8c 2 . 

By Markov's inequality the probability there is a Jj that intersects with three of more intervals is 
at most 8c 2 . For the second part, we consider the intervals Jj that overlap with two sets from X. 
The expected number of such intervals is at most 

(n»Q (2w/n) 2 < 4t 2 w/n = Cl n 1 /^ . 

Hence, by Markov's inequality, the second event occurs with probability at most c\. □ 

Lemma 2. Consider a random subset S C [n] of size n l l k . For sufficiently small constant C2, with 
probability at least 99/100, for each i,j G S, \j — i\ > C2n 1 ~ 2 / k . 

Lemma [2] follows from an elementary "birthday paradox" analysis. We are now ready to prove 
our main result. 

Theorem 1. Estimating Fk up to a factor 2 in the random-order data-stream model with probability 
at least 9/10 requires $7(n 1_2 5 / fc / log n) bits of space. 
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Proof. Let {S±, . . . , St} be an instance of i-party set-disjointness where [Si\ = cTn 1-3 /^ =: w, 
N = cin l ~ l l( 2lt \ and t = lOOn 1 ^. Consider t players where the i-th player knows Sj. Let A be a 
s-space, single-pass, data-stream algorithm that returns a 2-approximation for of a 0(n)-length 
stream with probability 99/100 on the assumption that the order of the stream elements is chosen 
uniformly at random from the set of all orderings. 

The players use A to solve the instance of i-party set-disjointness as follows. Using public 
randomness the players pick: 

(1) Sets ii = [ai, 61], . . . , It = [at, b t ] from Cycle nw (without loss of generality bi < bj if i < j). 

(2) A permutation a of [2n]. 

(3) A length m?/ k /c2 random binary string r. 

If 61 < w or there exists some j G [n] that appears in three of the intervals, the protocol terminates 
with failure. Note that the probability of this event is (w — l)/n + 1/100 < 2/100 by Lemma HJ 
Given the sets I\ , . . . , If we define the intervals 

A = f [a i+1 ,bi] if bi > a i+ i 

1 \ [bi + 1, a i+ i - 1] if bi < a i+ i 

where 60 = and at+\ = n + 1. We say Ai is a doubled interval if Ai = Ii n Ij+i and call it an easy 
interval otherwise. Let Bi = Ij \ (Ij-i U Ii+i). Then the Aj's and B^s are disjoint and, 

[n] = A U Bi U U . . . B t U A t . 

Also consider a partitioning of [n] into n 2 / k /c2 intervals Cj of length u>2 = C2?i 1_2//fe , [n] = UCj 

Player i constructs a string Si consisting of the elements from Si in a random order, with a 
applied to each. The j-th entry of the constructed stream is determined by 

(1) Player i if j G where is an easy interval and set to be a(n + j) 

(2) Player i if j £ £>j. The element is set to cr(n + j) with probability 1/2, and to s» otherwise, 
where j is the £-th element of Ii. 

(3) Player i if j £ U C m and r m = where Ai is a doubled interval. The element is set to 
s\ where j is the l-th element of Ii . 

(4) Player i — 1 if j G U C m and r m = 1 where Ai is a doubled interval. The element is 
set to s» _1 where j is the £-th element of 

By appealing to Lemma [U we note that the players can simulate an algorithm on this stream 
using messages with high probability at least 99/100. The size 

of each message is at most s. 

Hence the space use of the algorithm must be at least f2(ra 1-2,5//fc / logn). With probability at 
least 99/100, the multiplicity of the most frequent element of the stream is greater than (2n) 1 / fc . 
Hence, in the case that there exists j G [n] such that j G Si for all i G [t], > 2n and otherwise 
Fk ^ n - Therefore a 2-approximation of the F^ solves the instance of t-party set-disjointness. It 
remains to show that the ordering of the stream is near random so that we may assume that A 
returns a 2-approximation of Ff, as required. This follows because the location of the multiply 
occurring element (if one exists) were chosen by picking t random positions and then deleting each 
occurrence independently with probability 1 /2 (by Lemma [21 we may condition on the fact that 
no two elements occur within w 2 of each other). Hence the probability that the protocol succeeds 
is at least 99/100 - 2/100 - 1/100 - 100 = 19/20. □ 
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